Loan Approval Status

Abstract

Leveraging a Kaggle dataset comprising 4269 records and 13 columns, our project endeavors to forecast individual loan approval status. The dataset incorporates applicant details like education, self-employment status, annual income, CIBIL score, and asset values. Our in-depth analysis, documented on GitHub using R Markdown, employs diverse visualizations, statistical methods, and modeling techniques. The resulting insights have the potential to assist both lenders and borrowers by addressing requirements and reducing rejection rates.

Introduction

Our project, driven by personal experiences as international students, explores the financial challenges of pursuing academic dreams abroad. Having navigated the complexities of loan applications, particularly for education, we recognize broader implications across various markets. The global Personal Loan market, valued at $47.79 billion in 2020, is projected to reach $719.31 billion by 2030. The Global Student Loan market, at $3.93 trillion in 2021, is expected to grow to $8.75 trillion by 2031 (8.7% CAGR). The Global Automotive Finance market, valued at $259.84 billion in 2022, foresees a steady 7.3% CAGR from 2023 to 2030. Simultaneously, the Global Home Loan market, at $4.52 trillion in 2021, is set to soar to $33.3 trillion by 2031 (22.3% CAGR). Additionally, the global FinTech lending market, valued at $449.89 billion in 2020, is projected to reach $4,957.16 billion by 2030. Through our project, we aim to provide valuable insights into loan dynamics, potentially enhancing application efficiency and success rates globally.

Literature Survey

In this literature survey, we explore existing research on loan approval prediction. Reviewing traditional methods, machine learning models, and recent trends will inform our study, addressing gaps and challenges. Sheikh et al. [1] performed a machine learning based analysis of loan approval by employing several models such as Support Vector Machine, Logistic Regression etc. Ndayisenga [2] applied advanced algorithms like Gradient Boosting and Random Forest, through which they concluded the significant emphasis on credit score on the likelihood of loan approval. Additionally, Murthy et al.’s [3] research involved analyzing the probability of loan approval using KNN and Decision Tree, along with a dedicated portal for quick decision making.

Data and Methodology

The project “Loan Approval Dataset” utilizes a comprehensive dataset sourced from Kaggle, featuring records of 4269 applicants. This dataset includes attributes such as Education, No. of Dependents, Self-Employment, Annual Income, Value of Assets, CIBIL Score, and Loan status.

The goal of this project is to analyze applicant records and identify the factors contributing to loan approval.

  • Data Preparation

    • Data Collection: The dataset, sourced from the reputable data-sharing platform Kaggle, provides a robust repository of loan applicant records.

    • Data Cleaning: We performed data cleansing, addressing null values, removing irrelevant columns, validating data types, and ensuring overall dataset consistency for enhanced accuracy and reliability.

  • Methodological Approach

    • Descriptive Statistics: The dataset underwent initial exploration through the computation of summary statistics, shedding light on the educational status, employment details, CIBIL Score, and asset values of applicants. The analysis categorized loan status into two variables: Approved and Rejected.

    • Visualization Techniques: Diverse visualization functions were employed to craft informative charts and graphs, facilitating the effective presentation of findings and enhancing the comprehension of complex patterns.

    • Hypothesis Testing: Various Statistical tests were conducted to validate hypotheses.

    • Correlation Analysis: Correlation techniques were applied to examine the relationships between variables concerning loan status.

    • Variable Factorization: Qualitative variables were factored using the as.factor() function for analysis optimization.

Data preprocessing

Importing the Data

Structure of the data

## 'data.frame':    4269 obs. of  13 variables:
##  $ loan_id                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ no_of_dependents        : int  2 0 3 3 5 0 5 2 0 5 ...
##  $ education               : chr  " Graduate" " Not Graduate" " Graduate" " Graduate" ...
##  $ self_employed           : chr  " No" " Yes" " No" " No" ...
##  $ income_annum            : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
##  $ loan_amount             : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ loan_term               : int  12 8 20 8 20 10 4 20 20 10 ...
##  $ cibil_score             : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ residential_assets_value: int  2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
##  $ commercial_assets_value : int  17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
##  $ luxury_assets_value     : int  22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
##  $ bank_asset_value        : int  8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
##  $ loan_status             : chr  " Approved" " Rejected" " Rejected" " Rejected" ...

Description of the variables:

  • loan_id: Loan Application Id

  • no_of_dependents: Applicant Dependents

  • education: Graduate/Not Graduate

  • self_employed: Yes/No

  • income_annum: Annual Income

  • loan_amount: Loan Value

  • loan_term: Loan Term in Years

  • cibil_score: Credit Score

  • residential_assets_value: Value of Residential Assets

  • commercial_assets_value : Value of Commercial Assets

  • luxury_assets_value : Value of Luxury Assets

  • bank_asset_value: Value of Bank Assets

  • loan_status: Approved / Rejected

These variables are further classified as:

  • Qualitative: education, self_employed, loan_status

  • Quantitative: no_of_dependents, income_annum, loan_amount, loan_term, cibil_score, residential_assets_value, commercial_assets_value, luxury_assets_value, bank_asset_value, loan_status

We excluded the loan_id variable using the subset() function, considering its minimal contribution to data analysis.

Summary of the data

##  no_of_dependents         education    self_employed  income_annum    
##  Min.   :0.0       Graduate    :2144    No :2119     Min.   : 200000  
##  1st Qu.:1.0       Not Graduate:2125    Yes:2150     1st Qu.:2700000  
##  Median :3.0                                         Median :5100000  
##  Mean   :2.5                                         Mean   :5059124  
##  3rd Qu.:4.0                                         3rd Qu.:7500000  
##  Max.   :5.0                                         Max.   :9900000  
##   loan_amount         loan_term     cibil_score  residential_assets_value
##  Min.   :  300000   Min.   : 2.0   Min.   :300   Min.   : -100000        
##  1st Qu.: 7700000   1st Qu.: 6.0   1st Qu.:453   1st Qu.: 2200000        
##  Median :14500000   Median :10.0   Median :600   Median : 5600000        
##  Mean   :15133450   Mean   :10.9   Mean   :600   Mean   : 7472617        
##  3rd Qu.:21500000   3rd Qu.:16.0   3rd Qu.:748   3rd Qu.:11300000        
##  Max.   :39500000   Max.   :20.0   Max.   :900   Max.   :29100000        
##  commercial_assets_value luxury_assets_value bank_asset_value  
##  Min.   :       0        Min.   :  300000    Min.   :       0  
##  1st Qu.: 1300000        1st Qu.: 7500000    1st Qu.: 2300000  
##  Median : 3700000        Median :14600000    Median : 4600000  
##  Mean   : 4973155        Mean   :15126306    Mean   : 4976692  
##  3rd Qu.: 7600000        3rd Qu.:21700000    3rd Qu.: 7100000  
##  Max.   :19400000        Max.   :39200000    Max.   :14700000  
##     loan_status  
##   Approved:2656  
##   Rejected:1613  
##                  
##                  
##                  
## 

The summary() function provides a statistical summary of the entire dataset.

Visualizations

Data Pre-processing

We have removed unnecessary whitespaces from column names

Box plot of Number of Dependents and Loan Status

The box plot comparing the number of dependents and loan status reveals minimal variation. Both approved and rejected statuses exhibit similar distributions, with consistent median values. This suggests that the number of dependents has no significant impact on loan approval status.

Density plot of Loan Term based on Loan Status

The density plot highlights a clear relationship between loan approval/rejection and the loan term. Notably, applications with a term of 0-5 years show the highest approval rate, while those exceeding 5 years face more rejections. This pattern indicates a lending strategy favoring individuals capable of immediate repayment.

Scatter Plot of CIBIL Score vs Loan Amount

The scatter plot reveals a distinct correlation between loan amount and CIBIL score. Rejections are prominent in the CIBIL score range of 300-550, while approvals rise significantly beyond a CIBIL score of 550, even for loan amounts exceeding 35M.

Stacked Bar between Loan Status and Self Employment

The bar plot shows minimal differentiation in loan approval rates between self-employed and non-self-employed individuals, suggesting that self-employment may not be a decisive factor influencing loan approval.

Scatter Plot of Loan Amount vs Commercial assets value based on Loan Status

The scatter plot depicts a positive correlation between commercial asset value and loan amount, implying larger loans align with higher asset values. Similar distributions for approvals and rejections yield comparable approval and rejection rates for both commercial asset value and loan amount.

Scatter Loan Amount vs Residential Assets Value based on Loan Status

The graph indicates a positive correlation between residential asset value and loan amount, implying that higher residential values correspond to increased loan amounts. Similar distributions for approvals and rejections result in equal approval and rejection rates.

Scatter Plot of Loan Amount vs Luxury assets value based on Loan Status

The scatter plot reveals a positive correlation between luxury asset value and loan amount, suggesting that an increase in luxury asset value corresponds to higher loan amounts, with elevated chances of approval.

Scatter Plot of Loan Amount vs Bank Assets Value

Based on the graph, there is a positive correlation between luxury assets value and loan amount, indicating that as luxury assets value increases, so does the loan amount.

Scatter Plot of Loan Amount vs Income Per Annum based on Loan Status

The scatter plot indicates a direct correlation between annual income and loan amount, implying higher income aligns with larger loans. Approval and rejection rates appear consistent across different income and loan amount levels.

Box plot between Cibil Score vs Self Employed based on Loan Status

The box plot analysis indicates that CIBIL score significantly influences loan status, whereas self-employment status does not exhibit a notable impact.

Density Plot of Bank Assets grouped by Loan Status

Analyzing the density plot reveals a consistent loan status irrespective of bank assets, suggesting loan approval’s independence from bank assets sfluctuations.

Density Plot of CIBIL Score grouped by Loan Status

The density plot highlights that applicants with a higher CIBIL score (typically 500+) have a greater likelihood of loan approval, while applications with a score below that are rejected. This underscores the crucial role of a good CIBIL score and its significant impact on loan applications.

Correlation Plot

The correlation plot underscores the crucial link between cibil_score and loan_status, highlighting a strong correlation. Additionally, loan_amount shows noteworthy connections with various asset values, underscoring the importance of creditworthiness and financial factors in loan approval.

#STATISTICAL TEST

T-Test on Loan Status (Approval/Rejection) and Cibil Score

## 
##  Welch Two Sample t-test
## 
## data:  Approved$cibil_score and rejected$cibil_score
## t = 88, df = 4263, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  268 280
## sample estimates:
## mean of x mean of y 
##       703       429

Null Hypothesis (\(H_{0}\)): CIBIL score has no significant association with loan status.

Alternate Hypothesis (\(H_{A}\)): CIBIL score has significant association with loan status.

The p-value \(0\), is very less than the standard alpha value of 0.05, hence, we reject the NULL hypothesis and conclude that CIBIL score has significant association with the probability of loan approval.

Chi-squared test between Education and Loan Status

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  c
## X-squared = 0.08, df = 1, p-value = 0.8

Null Hypothesis (\(H_{0}\)): Education level and loan status are independent of each other.

Alternate Hypothesis (\(H_{A}\)): Education level and loan status are dependent on each other.

The high p-value of \(0.772\) for education level and loan status leads to the acceptance of the null hypothesis. Consequently, we conclude that an applicant’s education level has no significant impact on loan approval.

T-Test between Loan Status and Bank Asset Value

## 
##  Welch Two Sample t-test
## 
## data:  Approved$bank_asset_value and rejected$bank_asset_value
## t = -0.4, df = 3453, p-value = 0.7
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -245677  154809
## sample estimates:
## mean of x mean of y 
##   4959526   5004960

Null Hypothesis (\(H_{0}\)): Bank asset value and loan status are independent of each other.

Alternative Hypothesis (\(H_{A}\)): Bank asset value and loan status are dependent on each other.

Bank Asset Value and loan status have a high p-value of \(0.656\). Thus, we cannot reject the null hypothesis. We can therefore state that bank asset value and loan status are independent of each other and are not significantly associated.

T-Test between Loan Status and Resedential Assets Value

## 
##  Welch Two Sample t-test
## 
## data:  Approved$residential_assets_value and rejected$residential_assets_value
## t = -0.9, df = 3400, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -595310  209937
## sample estimates:
## mean of x mean of y 
##   7399812   7592498

Null Hypothesis (\(H_{0}\)): There is no significant association between the values of residential asset and loan approval status.

Alternative Hypothesis (\(H_{A}\)): There is a significant association between the values of residential asset and loan approval status.

With a p-value of \(0.348\), we cannot reject the null hypothesis and thus, we conclude from the null hypothesis that there exists no significant association between residential assets value and loan status.

T-test between number of dependents and loan status

## 
##  Welch Two Sample t-test
## 
## data:  Approved$no_of_dependents and rejected$no_of_dependents
## t = -1, df = 3400, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1683  0.0416
## sample estimates:
## mean of x mean of y 
##      2.47      2.54

Null Hypothesis (\(H_{0}\)): There is no significant association between the number of dependents and loan approval status.

Alternative Hypothesis (\(H_{A}\)): There is a significant association between the number of dependents and loan approval status.

With a p-value of \(0.237\), we fail to reject the null hypothesis. Therefore, we can conclude that there is no significant association between the number of dependents and loan status.

T-test between luxury assets value and loan status

## 
##  Welch Two Sample t-test
## 
## data:  a$luxury_assets_value and r$luxury_assets_value
## t = -1, df = 3442, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -851752  271073
## sample estimates:
## mean of x mean of y 
##  15016604  15306944

Null Hypothesis (\(H_{0}\)): The luxury assets value of an applicant has no significant association with their loan status.

Alternate Hypothesis (\(H_{A}\)): The luxury assets value of an applicant has significant association with their loan status.

With a p-value of \(0.311\), we cannot reject the null hypothesis and thus, we conclude that the luxury assets value of an applicant has no significant association with their loan status.

## Pearson Correlation Coefficient: 0.00844
## p-value: 0.582

The low Pearson correlation coefficient \(0.008\) indicates a weak relationship between loan_term and loan_amount. Moreover, the high p-value \(0.582\) suggests the observed correlation is not statistically significant.

Model Selection

Regression problem

## Reordering variables and trying again:

## Subset selection object
## Call: regsubsets.formula(loan_amount ~ ., data = data, nvmax = 10, 
##     nbest = 2, method = "exhaustive")
## 12 Variables  (and intercept)
##                          Forced in Forced out
## no_of_dependents             FALSE      FALSE
## educationNot Graduate        FALSE      FALSE
## self_employedNo              FALSE      FALSE
## income_annum                 FALSE      FALSE
## loan_term                    FALSE      FALSE
## cibil_score                  FALSE      FALSE
## residential_assets_value     FALSE      FALSE
## commercial_assets_value      FALSE      FALSE
## luxury_assets_value          FALSE      FALSE
## bank_asset_value             FALSE      FALSE
## loan_statusRejected          FALSE      FALSE
## self_employedOther           FALSE      FALSE
## 2 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           no_of_dependents educationNot Graduate self_employedNo
## 1  ( 1 )  " "              " "                   " "            
## 1  ( 2 )  " "              " "                   " "            
## 2  ( 1 )  " "              " "                   " "            
## 2  ( 2 )  " "              " "                   " "            
## 3  ( 1 )  " "              " "                   " "            
## 3  ( 2 )  " "              " "                   " "            
## 4  ( 1 )  " "              " "                   " "            
## 4  ( 2 )  "*"              " "                   " "            
## 5  ( 1 )  "*"              " "                   " "            
## 5  ( 2 )  " "              " "                   " "            
## 6  ( 1 )  "*"              " "                   " "            
## 6  ( 2 )  "*"              " "                   " "            
## 7  ( 1 )  "*"              " "                   " "            
## 7  ( 2 )  "*"              " "                   " "            
## 8  ( 1 )  "*"              " "                   " "            
## 8  ( 2 )  "*"              " "                   " "            
## 9  ( 1 )  "*"              " "                   " "            
## 9  ( 2 )  "*"              " "                   "*"            
## 10  ( 1 ) "*"              " "                   "*"            
## 10  ( 2 ) "*"              "*"                   " "            
## 11  ( 1 ) "*"              "*"                   "*"            
## 11  ( 2 ) "*"              " "                   "*"            
##           self_employedOther income_annum loan_term cibil_score
## 1  ( 1 )  " "                "*"          " "       " "        
## 1  ( 2 )  " "                " "          " "       " "        
## 2  ( 1 )  " "                "*"          " "       " "        
## 2  ( 2 )  " "                "*"          " "       " "        
## 3  ( 1 )  " "                "*"          " "       "*"        
## 3  ( 2 )  " "                "*"          " "       " "        
## 4  ( 1 )  " "                "*"          " "       "*"        
## 4  ( 2 )  " "                "*"          " "       "*"        
## 5  ( 1 )  " "                "*"          " "       "*"        
## 5  ( 2 )  " "                "*"          "*"       "*"        
## 6  ( 1 )  " "                "*"          "*"       "*"        
## 6  ( 2 )  " "                "*"          " "       "*"        
## 7  ( 1 )  " "                "*"          "*"       "*"        
## 7  ( 2 )  " "                "*"          "*"       "*"        
## 8  ( 1 )  " "                "*"          "*"       "*"        
## 8  ( 2 )  " "                "*"          "*"       "*"        
## 9  ( 1 )  " "                "*"          "*"       "*"        
## 9  ( 2 )  " "                "*"          "*"       "*"        
## 10  ( 1 ) " "                "*"          "*"       "*"        
## 10  ( 2 ) " "                "*"          "*"       "*"        
## 11  ( 1 ) " "                "*"          "*"       "*"        
## 11  ( 2 ) "*"                "*"          "*"       "*"        
##           residential_assets_value commercial_assets_value luxury_assets_value
## 1  ( 1 )  " "                      " "                     " "                
## 1  ( 2 )  " "                      " "                     "*"                
## 2  ( 1 )  " "                      " "                     " "                
## 2  ( 2 )  " "                      "*"                     " "                
## 3  ( 1 )  " "                      " "                     " "                
## 3  ( 2 )  " "                      "*"                     " "                
## 4  ( 1 )  " "                      "*"                     " "                
## 4  ( 2 )  " "                      " "                     " "                
## 5  ( 1 )  " "                      "*"                     " "                
## 5  ( 2 )  " "                      "*"                     " "                
## 6  ( 1 )  " "                      "*"                     " "                
## 6  ( 2 )  "*"                      "*"                     " "                
## 7  ( 1 )  "*"                      "*"                     " "                
## 7  ( 2 )  " "                      "*"                     "*"                
## 8  ( 1 )  "*"                      "*"                     "*"                
## 8  ( 2 )  "*"                      "*"                     " "                
## 9  ( 1 )  "*"                      "*"                     "*"                
## 9  ( 2 )  "*"                      "*"                     "*"                
## 10  ( 1 ) "*"                      "*"                     "*"                
## 10  ( 2 ) "*"                      "*"                     "*"                
## 11  ( 1 ) "*"                      "*"                     "*"                
## 11  ( 2 ) "*"                      "*"                     "*"                
##           bank_asset_value loan_statusRejected
## 1  ( 1 )  " "              " "                
## 1  ( 2 )  " "              " "                
## 2  ( 1 )  " "              "*"                
## 2  ( 2 )  " "              " "                
## 3  ( 1 )  " "              "*"                
## 3  ( 2 )  " "              "*"                
## 4  ( 1 )  " "              "*"                
## 4  ( 2 )  " "              "*"                
## 5  ( 1 )  " "              "*"                
## 5  ( 2 )  " "              "*"                
## 6  ( 1 )  " "              "*"                
## 6  ( 2 )  " "              "*"                
## 7  ( 1 )  " "              "*"                
## 7  ( 2 )  " "              "*"                
## 8  ( 1 )  " "              "*"                
## 8  ( 2 )  "*"              "*"                
## 9  ( 1 )  "*"              "*"                
## 9  ( 2 )  " "              "*"                
## 10  ( 1 ) "*"              "*"                
## 10  ( 2 ) "*"              "*"                
## 11  ( 1 ) "*"              "*"                
## 11  ( 2 ) "*"              "*"

The regsubsets() identifies the optimal subset of predictor variables by minimizing or maximizing selected criteria like Adjusted R-squared (adjr2), R-squared (r2), Bayesian Information Criterion (BIC), or Mallows’ Cp.

In case of the Adjusted R-squared plot, the best possible set of predictors are found to be: no_of_dependents, loan_term, income_annum, commercial_assests_value, cibil_score and loan_status. In case of the R-squared plot, the best possible set of predictors are found to be: no_of_dependents, loan_term, income_annum, commercial_assests_value, cibil_score and loan_status. From the BIC plot we can observe that the best possible set of predictors are found to be: no_of_dependents, loan_term, income_annum, residential_assets_value, commercial_assests_value, cibil_score and loan_status. From the Cp Mallow plot we can observe that the best possible set of predictors are found to be: no_of_dependents, income_annum, residential_assets_value, commercial_assests_value, cibil_score and loan_status.

##                          Abbreviation
## no_of_dependents                    n
## educationNot Graduate              eG
## self_employedNo                   s_N
## income_annum                        i
## loan_term                         ln_
## cibil_score                       cb_
## residential_assets_value            r
## commercial_assets_value           c__
## luxury_assets_value               l__
## bank_asset_value                    b
## loan_statusRejected               l_R
## self_employedOther                s_O

From the Adjusted R-squared statistic plot, the most suitable set of predictors are found to be: no_of_dependents, loan_term, income_annum, commercial_assests_value, cibil_score, commercial_assests_value, luxury_assests_value and loan_status.

##                          Abbreviation
## no_of_dependents                    n
## educationNot Graduate              eG
## self_employedNo                   s_N
## income_annum                        i
## loan_term                         ln_
## cibil_score                       cb_
## residential_assets_value            r
## commercial_assets_value           c__
## luxury_assets_value               l__
## bank_asset_value                    b
## loan_statusRejected               l_R
## self_employedOther                s_O

The most relevant predictors from the Mallow Cp plot are found to be no_of_dependents, income_annum, commercial_assests_value, cibil_score and loan_status.

Classification Problem

## Fitting algorithm:  AIC-glm
## Best Model:
##              df deviance
## Null Model 4262     1880
## Full Model 4268     5661
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 3780, df = 6, p-value <2e-16
##   no_of_dependents education self_employed income_annum loan_amount loan_term
## 1            FALSE     FALSE         FALSE         TRUE        TRUE      TRUE
## 2            FALSE     FALSE         FALSE         TRUE        TRUE      TRUE
## 3            FALSE     FALSE         FALSE         TRUE        TRUE      TRUE
## 4            FALSE      TRUE         FALSE         TRUE        TRUE      TRUE
## 5            FALSE     FALSE         FALSE         TRUE        TRUE      TRUE
##   cibil_score residential_assets_value commercial_assets_value
## 1        TRUE                    FALSE                   FALSE
## 2        TRUE                    FALSE                   FALSE
## 3        TRUE                    FALSE                    TRUE
## 4        TRUE                    FALSE                   FALSE
## 5        TRUE                    FALSE                    TRUE
##   luxury_assets_value bank_asset_value Criterion
## 1                TRUE             TRUE      1892
## 2                TRUE            FALSE      1893
## 3                TRUE             TRUE      1893
## 4                TRUE             TRUE      1894
## 5                TRUE            FALSE      1894
##  no_of_dependents education       self_employed   income_annum   loan_amount   
##  Mode :logical    Mode :logical   Mode :logical   Mode:logical   Mode:logical  
##  FALSE:5          FALSE:4         FALSE:5         TRUE:5         TRUE:5        
##                   TRUE :1                                                      
##                                                                                
##                                                                                
##                                                                                
##  loan_term      cibil_score    residential_assets_value commercial_assets_value
##  Mode:logical   Mode:logical   Mode :logical            Mode :logical          
##  TRUE:5         TRUE:5         FALSE:5                  FALSE:3                
##                                                         TRUE :2                
##                                                                                
##                                                                                
##                                                                                
##  luxury_assets_value bank_asset_value   Criterion   
##  Mode:logical        Mode :logical    Min.   :1892  
##  TRUE:5              FALSE:2          1st Qu.:1893  
##                      TRUE :3          Median :1893  
##                                       Mean   :1893  
##                                       3rd Qu.:1894  
##                                       Max.   :1894

Model Creation

In our analysis, we utilize a dual approach, employing regression for predicting loan amount and classification for determining loan status. This enhances our model’s predictive capabilities in addressing different facets of the loan application process.

Train Test Split

The dataset was initially examined for the distribution of the target variable, loan_status (indicating loan approval). To promote model generalization, the data was later divided into training (80%) and test (20%) sets, ensuring reproducibility with a specified seed.

## 
## Approved Rejected 
##     2656     1613
## Rejected 
##    0.378
## [1] 0.8
## [1] 3415
## [1] 854

Regression

Linear Regression

A linear regression model was constructed using the lm() function, predicting loan_amount based on various features, including no_of_dependents, loan_term, income_annum, commercial_assets_value, cibil_score, and loan_status.

## 
## Call:
## lm(formula = loan_amount ~ no_of_dependents + loan_term + income_annum + 
##     commercial_assets_value + cibil_score + loan_status, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9773675 -1894729   -32406  2008070 10051244 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2.02e+06   3.60e+05    5.62  2.0e-08 ***
## no_of_dependents        -4.88e+04   3.03e+04   -1.61    0.108    
## loan_term                9.14e+03   9.17e+03    1.00    0.319    
## income_annum             2.96e+00   2.39e-02  123.95  < 2e-16 ***
## commercial_assets_value  3.02e-02   1.53e-02    1.98    0.048 *  
## cibil_score             -2.51e+03   4.73e+02   -5.30  1.2e-07 ***
## loan_statusRejected     -1.26e+06   1.69e+05   -7.41  1.5e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3360000 on 4262 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.862 
## F-statistic: 4.45e+03 on 6 and 4262 DF,  p-value: <2e-16
Summary of loan amount prediction
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.02e+06 3.60e+05 5.620 0.0000
no_of_dependents -4.88e+04 3.03e+04 -1.610 0.1075
loan_term 9.14e+03 9.17e+03 0.997 0.3190
income_annum 2.96e+00 2.39e-02 123.951 0.0000
commercial_assets_value 3.02e-02 1.53e-02 1.977 0.0481
cibil_score -2.51e+03 4.73e+02 -5.303 0.0000
loan_statusRejected -1.26e+06 1.69e+05 -7.413 0.0000
VIFs of the model.
cibil_score commercial_assets_value income_annum loan_statusRejected loan_term no_of_dependents
2.52 1.7 1.7 2.55 1.04 1

The summary statistics and variance inflation factor (VIF) were analyzed for insights, which gives us values lesser than 3 which means there is no multicollinearity in our features.

Results

## Training R-squared: 0.862
## Testing R-squared: 0.862

The scatter plot displays the actual versus predicted loan amounts, with a dashed red line denoting the ideal prediction scenario. R-squared values (r test_r_squared) for training and testing underscore the model’s strong explanatory power and generalization to new data.

Classification

Logistic Regression

This study uses logistic regression to create a predictive model for loan status, leveraging the glm() function. Key features, including dependents, annual income, loan amount, term, CIBIL score, luxury assets value, and bank assets value, are highlighted.

Model building

## 
## Call:
## glm(formula = loan_status ~ no_of_dependents + income_annum + 
##     loan_amount + loan_term + cibil_score + luxury_assets_value + 
##     bank_asset_value, family = "binomial", data = data_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.13e+01   4.76e-01   23.67  < 2e-16 ***
## no_of_dependents     6.60e-03   3.85e-02    0.17     0.86    
## income_annum         5.26e-07   9.54e-08    5.51  3.5e-08 ***
## loan_amount         -1.33e-07   1.99e-08   -6.72  1.9e-11 ***
## loan_term            1.49e-01   1.27e-02   11.72  < 2e-16 ***
## cibil_score         -2.46e-02   9.24e-04  -26.59  < 2e-16 ***
## luxury_assets_value -2.66e-08   1.97e-08   -1.35     0.18    
## bank_asset_value    -3.98e-08   3.63e-08   -1.10     0.27    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4528.9  on 3414  degrees of freedom
## Residual deviance: 1531.7  on 3407  degrees of freedom
## AIC: 1548
## 
## Number of Fisher Scoring iterations: 7

The coefficients table reveals the estimated effects of each predictor on the log-odds of loan approval. Key findings include:

The intercept has a substantial positive effect on the log-odds. Variables such as loan_term and CIBIL_score significantly impact loan approval, as indicated by their respective z-values and low p-values. no_of_dependents, income_annum, loan_amount, luxury_assets_value, and bank_asset_value show minimal impact on loan approval.

Feature Importance

Summary of logistic Regression for loan status
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.2729 0.4762 23.671 0.000
no_of_dependents 0.0066 0.0385 0.171 0.864
income_annum 0.0000 0.0000 5.512 0.000
loan_amount 0.0000 0.0000 -6.715 0.000
loan_term 0.1491 0.0127 11.718 0.000
cibil_score -0.0246 0.0009 -26.587 0.000
luxury_assets_value 0.0000 0.0000 -1.353 0.176
bank_asset_value 0.0000 0.0000 -1.097 0.273
Exponential of coefficients in Logit Reg
x
(Intercept) 7.87e+04
no_of_dependents 1.01e+00
income_annum 1.00e+00
loan_amount 1.00e+00
loan_term 1.16e+00
cibil_score 9.76e-01
luxury_assets_value 1.00e+00
bank_asset_value 1.00e+00

Feature importance summary:

  • The intercept is notably high, serving as a baseline for loan approval odds.
  • Number of dependents has minimal impact (non-significant).
  • Annual income and loan amount show limited influence on loan approval odds (coefficient of 1.00).
  • Loan term has a substantial positive impact on approval odds (16% increase per unit).
  • Higher CIBIL scores correspond to lower odds of loan approval.
  • Luxury assets and bank assets show minimal impact on approval odds.

Data imbalance

  • The data imbalance graph indicates a skewed distribution of approved and rejected applicants, both left and right skewed. Sole reliance on accuracy and ROC scores may be insufficient for our predictive analysis.

Train and test metrices

## [1] "Training Accuracy: 91.45 %"
## [1] "Test Accuracy: 93.09 %"
## [1] "Training Precision: 88.92 %"
## [1] "Training Recall: 88.51 %"
## [1] "Test Precision: 90.68 %"
## [1] "Test Recall: 90.97 %"

In the logistic regression model, the following performance metrics were observed:

  • Training Accuracy: 91.45%
  • Test Accuracy: 93.09%
  • Training Precision: 88.92%
  • Training Recall: 88.51%
  • Test Precision: 90.68%
  • Test Recall: 90.97%

These metrics reveal the model’s proficiency in predicting loan approval status with high accuracy and precision. Balanced recall scores suggest effective capture of both approved and rejected instances, showcasing the logistic regression model’s robust performance on training and test sets.

Confusion matrix

Confusion matrix from Logit Model
Predicted Approved Predicted Rejected Total
Actual Approved 1979 145 2124
Actual Rejected 150 1141 1291
Total 2129 1286 3415

This confusion matrix details the model’s predictions for Approved and Rejected cases, including True Positives (1979), False Positives (145), False Negatives (150), and True Negatives (1141). These metrics help calculate evaluation measures like precision, recall, and accuracy.

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

ROC and AUC curves measure the true positive rate (or sensitivity) against the false positive rate (or specificity). The AUC is always between 0.5 and 1. Values higher than 0.8 are considered good model fit.

McFadden pseudo R-squared

## 'log Lik.' 0.729 (df=8)

In logistic regression, the log-likelihood value stands at r mcFadden[1] with 8 degrees of freedom. This value is integral in computing McFadden’s pseudo R-squared, offering insights into the model’s fit relative to a basic intercept-only model. A higher pseudo R-squared value is typically indicative of a superior fit. The log-likelihood value of r mcFadden[1] plays a crucial role in evaluating the goodness-of-fit in our logistic regression analysis.

Data preprocessing

Factoring the data

## 'data.frame':    3415 obs. of  14 variables:
##  $ no_of_dependents        : int  3 3 2 3 5 4 3 3 3 5 ...
##  $ education               : Factor w/ 2 levels "Graduate","Not Graduate": 2 1 1 1 1 1 2 1 2 2 ...
##  $ self_employed           : num  2 1 2 2 1 2 2 2 1 1 ...
##  $ income_annum            : int  1600000 6700000 4700000 8000000 4200000 8700000 3800000 5500000 6200000 2700000 ...
##  $ loan_amount             : int  3900000 13400000 14000000 26200000 9400000 32100000 11400000 11600000 14600000 6800000 ...
##  $ loan_term               : int  20 14 12 16 6 8 4 6 2 6 ...
##  $ cibil_score             : int  804 782 784 890 678 397 323 745 664 461 ...
##  $ residential_assets_value: int  900000 4900000 13400000 15800000 4700000 8900000 5800000 9400000 18300000 4500000 ...
##  $ commercial_assets_value : int  2500000 4000000 2700000 4300000 5500000 17000000 4000000 8000000 8400000 2000000 ...
##  $ luxury_assets_value     : int  3200000 25800000 14800000 25000000 9900000 34700000 9600000 12800000 16800000 8200000 ...
##  $ bank_asset_value        : int  800000 9600000 6900000 4000000 5700000 4800000 4400000 5400000 6600000 1700000 ...
##  $ loan_status             : Factor w/ 2 levels "Approved","Rejected": 1 1 1 1 1 2 1 1 1 2 ...
##  $ prediction              : num  0.00512 0.0057 0.00193 0.00025 0.01819 ...
##  $ prob                    : num  0.00512 0.0057 0.00193 0.00025 0.01819 ...

KNN Model creation

Finding the best K value

##  num [1:2, 1:11] 1 0.542 3 0.573 5 ...

The above analysis delineates the influence of the number of neighbors (k) on the accuracy of loan status predictions. The accuracy vs. k chart reveals that selecting k=18 yields the highest accuracy in this particular kNN model.

KNN Evaluation metrices

For training

##           Actual
## Predicted  Approved Rejected
##   Approved     1855      980
##   Rejected      269      311
## [1] "Training Accuracy: 63.43 %"

The confusion matrix for the training data reveals that out of 2,625 instances, the model correctly predicted 1,855 approved and 311 rejected cases. The training accuracy is calculated at 63.43%.

For testing

##  Factor w/ 2 levels "Approved","Rejected": 1 1 1 1 1 2 1 1 2 2 ...
## [1] 854
## bank_18NN
## Approved Rejected 
##      704      150
## [1] "Test Accuracy: 61.12 %"

From the above test results it is evident that accuracy is approximately 61.12%, suggesting that the kNN model with k=18 achieved this level of accuracy in correctly predicting loan approval status on the given test data.

Confusion matrix

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  854 
## 
##  
##                            | bank_18NN 
## data_test[, "loan_status"] |  Approved |  Rejected | Row Total | 
## ---------------------------|-----------|-----------|-----------|
##                   Approved |       452 |        80 |       532 | 
##                            |     0.850 |     0.150 |     0.623 | 
##                            |     0.642 |     0.533 |           | 
##                            |     0.529 |     0.094 |           | 
## ---------------------------|-----------|-----------|-----------|
##                   Rejected |       252 |        70 |       322 | 
##                            |     0.783 |     0.217 |     0.377 | 
##                            |     0.358 |     0.467 |           | 
##                            |     0.295 |     0.082 |           | 
## ---------------------------|-----------|-----------|-----------|
##               Column Total |       704 |       150 |       854 | 
##                            |     0.824 |     0.176 |           | 
## ---------------------------|-----------|-----------|-----------|
## 
## 

The k-Nearest Neighbor (kNN) model achieved around 61.12% accuracy on the test dataset. Among 854 instances, it correctly predicted 532 “Approved” loans, achieving a precision of 90.68%, and accurately classified 70 “Rejected” loans, resulting in a recall of 46.7%.

Testing results

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Approved Rejected
##   Approved      452      252
##   Rejected       80       70
##                                         
##                Accuracy : 0.611         
##                  95% CI : (0.578, 0.644)
##     No Information Rate : 0.623         
##     P-Value [Acc > NIR] : 0.771         
##                                         
##                   Kappa : 0.075         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.850         
##             Specificity : 0.217         
##          Pos Pred Value : 0.642         
##          Neg Pred Value : 0.467         
##              Prevalence : 0.623         
##          Detection Rate : 0.529         
##    Detection Prevalence : 0.824         
##       Balanced Accuracy : 0.534         
##                                         
##        'Positive' Class : Approved      
## 
## [1] "Precision: 0.64"
## [1] "Recall: 0.85"

The k-Nearest Neighbor model outperforms the competition in identifying “Approved” loans with a precision of 64% and a recall of 85%, indicating slight agreement beyond chance.

AUC ROC curve for KNN

With an AUC of 0.534 and test/train accuracies at 61.1%/63.43%, the model’s performance is subpar. Recognizing this, we’re exploring alternative non-parametric models for enhanced predictability.

Decision Tree model

Training

## [1] "Train Accuracy: 96.81 %"

Testing

## [1] "Test Accuracy: 96.49 %"

Testing results

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Approved Rejected
##   Approved      529       27
##   Rejected        3      295
##                                        
##                Accuracy : 0.965        
##                  95% CI : (0.95, 0.976)
##     No Information Rate : 0.623        
##     P-Value [Acc > NIR] : < 2e-16      
##                                        
##                   Kappa : 0.924        
##                                        
##  Mcnemar's Test P-Value : 2.68e-05     
##                                        
##             Sensitivity : 0.994        
##             Specificity : 0.916        
##          Pos Pred Value : 0.951        
##          Neg Pred Value : 0.990        
##              Prevalence : 0.623        
##          Detection Rate : 0.619        
##    Detection Prevalence : 0.651        
##       Balanced Accuracy : 0.955        
##                                        
##        'Positive' Class : Approved     
## 
## [1] "Precision: 0.95"
## [1] "Recall: 0.99"

Based on the model results, we obtain a precision and recall score of 95% and 99% respectively.

AUC ROC curve of decision trees

## AUC: 0.955

We obtain the above AUC-ROC curve with an area under the curve value of ~95.5%, indicating that this model is a good fit for our data.

AUC Scores

## [1] "AUC score of KNN 0.533507682249101"
## [1] "AUC score of decision Tress 0.95525498528931"
## [1] "AUC score of Logistic regressor 0.96766802184032"

So far, based on the three models we have implemented, Logistic Regression turns out to be the best performer with an AUC score of ~96.7%.

Random forest

## 
## Call:
##  randomForest(formula = loan_status ~ no_of_dependents + income_annum +      loan_amount + loan_term + cibil_score + luxury_assets_value +      bank_asset_value, data = data_train, ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.58%
## Confusion matrix:
##          Approved Rejected class.error
## Approved     2108       16     0.00753
## Rejected       38     1253     0.02943
  • The confusion matrix indicates strong performance in predicting both “Approved” and “Rejected” classes.
  • Class error rates are low, with 0.75% for “Approved” and 2.94% for “Rejected,” showcasing accurate predictions.
  • The model’s ability to correctly identify instances is evident from the high numbers on the diagonal of the confusion matrix.

Overall, the random forest model proves effective in classifying loan applications based on provided features, demonstrating a low out-of-bag error rate.

Feature Importance Summary

##                     Approved Rejected MeanDecreaseAccuracy MeanDecreaseGini
## no_of_dependents        1.33     4.34                 3.74             15.8
## income_annum           15.75    14.63                21.43             37.2
## loan_amount            26.23    17.25                32.05             58.8
## loan_term              87.47    80.78               103.77             95.8
## cibil_score           383.23   399.68               440.78           1328.5
## luxury_assets_value    12.77    10.13                16.37             38.7
## bank_asset_value       10.79     9.59                14.86             29.8
  1. Loan Term (loan_term):
    • Highest importance in both accuracy and Gini impurity reduction.
  2. CIBIL Score (cibil_score):
    • Significantly important for predictive accuracy and reducing impurity.
  3. Loan Amount (loan_amount):
    • Shows substantial importance in both metrics.
  4. Income (income_annum) and Luxury Assets (luxury_assets_value):
    • Moderately important.
  5. Bank Asset Value (bank_asset_value):
    • Relatively lower importance.
  6. Number of Dependents (no_of_dependents):
    • Appears least impactful on model performance.

Model metrics random forest

Training and training accuracy

## [1] "Train Accuracy: 100 %"
## [1] "Test Accuracy: 98.59 %"
## [1] "Precision: 0.97"
## [1] "Recall: 0.99"

-The model attained flawless accuracy on the training set, showcasing its ability to learn from the dataset. It also sustained a high accuracy of 98.59% on the test data, indicating strong generalization to new instances.

  • The scores suggest high precision (0.97%) in correctly identifying positives and strong recall (0.99%) in capturing most actual positives. The model excels in both precision and recall, showcasing its effectiveness in loan application classification.

AUC ROC Curve

## AUC: 0.983

The random forest model achieved an Area Under the Curve (AUC) score of 95.53%.

An AUC score nearing 1 signifies strong discriminatory power, indicating the model excels in distinguishing between classes. The high AUC underscores the random forest model’s robustness in classifying loan applications.

AUC Scores

## [1] "AUC score of KNN 0.533507682249101"
## [1] "AUC score of decision Tress 0.95525498528931"
## [1] "AUC score of Logistic regressor 0.96766802184032"
## [1] "AUC score of Random Forest 0.983205295848316"

Model Result chart

  • Random Forest: Achieving the highest AUC score of 98.321%, the Random Forest model demonstrates superior discriminative capability. This underscores its suitability for the given classification task.

  • Logistic Regression: With an AUC score of 96.767%, Logistic Regression performs admirably and proves to be a robust model for the task.

  • Decision Tree: The Decision Tree model yields a respectable AUC score of 95.525%, positioning it as a viable choice for classification.

  • KNN: While KNN lags behind the other models with an AUC score of 53.351%, its performance is noteworthy and may still be relevant depending on the specific requirements of the application.

Conclusion

In summary, our project delves into a Kaggle dataset to scrutinize the intricacies of individual loan approval. Drawing from our experiences as international students, we provide insights into the expanding lending landscape. Our results underscore the pivotal role of the CIBIL score and employ various prediction models for precision. We advise applicants to prioritize their credit scores and align preferences with financial goals. Simultaneously, lending institutions can bolster decision-making efficiency. Overall, our work enhances the comprehension of loan dynamics, offering advantages to lenders and borrowers globally

References

[1] Sheikh, M. A., Goel, A. K., & Kumar, T. (2020, July). An approach for prediction of loan approval using machine learning algorithm. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC)(pp. 490-494). IEEE.

[2] Ndayisenga, T. (2021). Bank loan approval prediction using machine learning techniques (Doctoral dissertation).

[3] Murthy, P. S., Shekar, G. S., Rohith, P., & Reddy, G. V. V. (2020). Loan Approval Prediction System Using Machine Learning. Journal of Innovation in Information Technology, 4(1), 21-24.